tickets/SP-3008: Bug fixes and improved tools for comparison of completed visits and pre-night sims#196
tickets/SP-3008: Bug fixes and improved tools for comparison of completed visits and pre-night sims#196
Conversation
0d5e4b2 to
7c25abc
Compare
7c25abc to
85c963f
Compare
rhiannonlynne
left a comment
There was a problem hiding this comment.
I have comments, because some of the docstrings could be clearer and some were actually completely opaque as they are.
One thing in particular I had not taken into account was that you were putting together all of the different simulation visits, plus all of the completed visits, into one big data frame. This is not easy to tell from the information that is available in this module.
I suspect that it may be starting to make sense to make a class to handle "all of the data acquisition from a night", including fetching the given set of simulation visits, the completed visits, and then concatenating them in the way that currently only exists in the notebook. You later end up depending on having the data in this particular format, but it's not an obvious format and not created anywhere within this code base -- which to me, means that maybe this should be shifted to a class, with methods to fetch and manipulate the data, so that you can maintain the particular data format you want.
| from .multisim import match_visits_across_sims | ||
|
|
||
|
|
||
| def assign_field_hpids( |
There was a problem hiding this comment.
This probably needs a different name, to reduce confusion about what it does. "assign_field_hpids" doesn't imply matching, but one of the requirements for this function is two different data frames to match (and this is what drives the hpid assignment, if I'm getting it right).
Perhaps it could simply be "match_pointings_hpid" ?
There was a problem hiding this comment.
Or maybe "assign_matching_field_hpids"?
I don't know .. just assign_field_hpids is accurate but makes me think you could have done what you probably did first (just assign each pointing to the nearest hpid).
There was a problem hiding this comment.
How about I modify this to just assign each pointing to the nearest hpid, and then have a separate function, snap_to_referece_hpid_when_close, that changes assigned hpids to one those from a reference set when the coordinates are close, and then have the notebook call both in succession?
There was a problem hiding this comment.
By the time you get rid of the matching, there's really nothing left but the call to hp.ang2pix, so maybe we should just drop the assign_field_hpids function entierly and have the notebook call hp.ang2pix directly, followed by snap_to_reference_hpid_when_close.
There was a problem hiding this comment.
I think you're maybe just (at the end of the day) proposing the new name here then?
would then "assign_matching_reference_hpids" or something work?
There was a problem hiding this comment.
Mostly, with the difference that the initial setting of the id is by a direct call to hp.ang2pix in the notebook, and a separate call to snap to the reference. This does result in more being done in the notebook, but has the advantage that in the notebook it makes it clear what's going on. I'm happy either with your proposed renaming, or with pulling hp.ang2pix out.
schedview/compute/comparesim.py
Outdated
| simulated_visits : `pd.DataFrame` | ||
| Simulated visit table; must contain columns identified by ``ra_col`` | ||
| and ``dec_col`` (both in degrees). An integer column named | ||
| ``hpid_col`` will be added or overwritten. |
There was a problem hiding this comment.
These are the visit pointings that are being used as the basis for the match, correct? Might be useful to make that comment here.
There was a problem hiding this comment.
And I see (from the notebook) that it's actually the simulated visits from all of the simulations, not a single sim.
Are there failure modes in matching, if different simulations had slightly different field centres?
There was a problem hiding this comment.
If different simulations have different field centers, they will match if they have they are close enough to correspond to the same healpix, and not otherwise.
It's not clear to me whether this is actually a failure, though.
What we're trying to do here is to match hpids when a completed visit corresponds to the visit in a simulation.
If different simulations have different field centers, then the completed visit could have come from a scheduler request that matches at most one of them, and the simulation that it is nearest to seems like the best guess (but not guaranteed).
There was a problem hiding this comment.
An alternative would be to ignore dithering altogether: instead of matching to field centers in simulation, do the catalog match to the original undithered WFD tessellation, and match both coordinates in completed and all the simulations to these points. This would mean, however, that pointings could be a large fraction of a camera diameter off and still be considered a "match."
| and ``dec_col`` (both in degrees). An integer column named | ||
| ``hpid_col`` will be added or overwritten. | ||
|
|
||
| completed_visits : `pd.DataFrame` |
There was a problem hiding this comment.
Will the completed visits hpid be overwritten, when compared against a new simulation?
There was a problem hiding this comment.
If it is set already (and inplace=True), yes. I'll note that in the docstring.
| Completed visit table; same column requirements as | ||
| ``simulated_visits``. | ||
|
|
||
| field_hpix_nside : `int` |
There was a problem hiding this comment.
Is there a reasonable useful default for this?
I imagine that your testing has shown that some values tend to work pretty well, while others are overkill and others are way too coarse. I'm guessing that either 32 or at the outside 64 are best choices?
There was a problem hiding this comment.
I've been using 2*18=262144, or about 0.8" resolution, but highter would be just as good. My goal here was to make sure that it is high enough that simulated pointings that are different should have different hpids, and I wasn't concerned with it being too high because matching sims that match each other should have exactly matching coordinates.
schedview/compute/comparesim.py
Outdated
|
|
||
| field_coord_tolerance_deg : `float` | ||
| Maximum angular separation (deg) for a completed visit to be considered | ||
| a match to a simulated healpix center. |
There was a problem hiding this comment.
Likewise, is there a recommendation for this? I imagine it's on the order of the small offsets you've been noticing, which ought to hold up as long as the tessellation we're using matches the tessellation on the summit, for that night.
I imagine that the recommendation is something like
"On a typical night, where the field tessellation on the summit matches the tessellation in the simulation (for wide-area surveys), the tolerance is likely around 0.001 (??) degrees. For dithering around field centres and for mismatches in tessellation, the tolerance might be set larger, potentially up to 1.7 degrees." ( ??)
| within_tol = sim_hp_sep.deg < coord_match_tolerance_deg | ||
| completed_visits.loc[within_tol, hpid_col] = sim_hpids[sim_hp_match_idx[within_tol]] | ||
|
|
||
| return simulated_visits, completed_visits |
There was a problem hiding this comment.
I was wondering why we couldn't pass in multiple simulated_visits data frames at the same time, but I think it's because you're matching the completed visits to those simulated visits, and then the completed_visits can end up with different healpix ids depending on which simulation it's been matched against.
There was a problem hiding this comment.
As I think you noticed elsewhere, I pass visits from all simulations in the same DataFrame.
| The simulation index to compare against the observation (index 0). | ||
| visits : `pd.DataFrame` | ||
| Table of visits that contains at least the columns ``sim_index`` and | ||
| ``start_timestamp``. |
There was a problem hiding this comment.
I don't understand from this docstring what "visits" is .. it seems like it's not just the information from either the completed visits or a single simulation, but potentially both? How are they combined?
Is there some function somewhere you'd use to create the "visits" data frame and could that be referenced here?
There was a problem hiding this comment.
Yes, visits is the DataFrame with all visits, both from completed and simulated simulations.
| Parameters | ||
| ---------- | ||
| sim_index : `int` | ||
| The simulation index to compare against the observation (index 0). |
There was a problem hiding this comment.
What is sim_index? Is it part of how the "visits" data frame referenced next was created?
There was a problem hiding this comment.
It's an id that indentifies sequences of visits. That is, the visits DataFrame contains all visits, including the simulations and completed visits. The sim_index column of the visits dataframe identifies which sim (or completed) that specific visit came from.
| dec=completed_visits[decl_col].values * u.deg, | ||
| frame="icrs", | ||
| ) | ||
| sim_hp_match_idx, sim_hp_sep, _ = completed_coords.match_to_catalog_sky(sim_hp_coords) |
There was a problem hiding this comment.
Given that you're using a really high side by default for assigning the sim visits to healpix ids -- could there be a potential failure where:
sim 1 has field centres which are slightly different than the field centres in sim 2
then the completed visits would only match either the healpix in sim 1 or sim 2, but if you'd used a lower nside they would both have actually matched and both been within the tolerance?
There was a problem hiding this comment.
See my comment on the corresponding comment on the docstring. My intention was to distinguish dithers, so completed visits in one dither not match simulated visits in a different dither, but rather only accommodate differences between the coordinates requested by the FBS and the coordinates reported in ConsDB.
|
To be honest, I'm still a little confused about the matching, after seeing how you're actually fetching and using the data frame. |
I dislike the idea of putting this into a class because it interferes with code reuse. Most of what I want to do with the visits table involves Instead, I propose defining and documenting a python This will make From the point of view of the runtime python interpreter, it So, using Alternately, I am aware of pandera, a module for enforcing specific schema for data frames. I have never tried it, but would be open to it. |
|
I will move the construction of the |
|
I don't really like the creation of the NewType, because if you really want it to be a data frame for use in other places, then it should just be a data frame .. making it a specific type seems likely to lead to (at some point) relying on a specific property (either data value or behaviour) of the NewType .. and then it's not just a data frame anymore anyway. Ok - leave it as a pandas data frame, but indicate where the simulated_visits ought to be coming from (from querying for all of the simulated visits in the night) and then .. does it make sense to just combine the simulated and completed visits when returning from the assign_hpid_field_centers function? |
The reason why I proposed using Or, we can leave it as a So yes, the best answer may be just for me to add better docstrings to the functions that take the I am in the process of writing such functions now (moving more code from the notebooks to |
|
I moved some of the discussion about matching to slack here since maybe this is a question that would benefit from other viewpoints. |
This started off as a bug fix for comparenight being too picky about how it was matching simulated visits to visits in the prenight simulation, failing to match visits when the pointing was not in the same healpixel. This change replaces the simple healpix-based strategy with coordinate matching taking advantage of astropy's catalog matching tools.
While debugging, I also added several improved tools for showing tables and figures that compare simulated with actual visits. I think these greatly improve on what was in comparenight before. There will be a corresponding PR on schedview_notebooks (which will depend on this PR in schedview) shortly.